Fault-tolerant linear solvers via selective reliability
نویسندگان
چکیده
interface. The Failable interface has methods for marking, unmarking, and checking whether the object’s data are allowed to experience bit flips. FTGMRES mark failability of the relevant objects on entry to the inner solver, and unmarks them on exit. We describe below how we implemented this high-level interface using the low-level application / OS interface presented in Section 9. Trilinos is built on the Petra framework of distributed linear algebra objects. Petra has two implementations: Epetra (Essential Petra), and Tpetra (Templated Petra). We use only Tpetra for our prototype, because Tpetra’s intranode parallel support library, Kokkos [6], has the necessary features to support our desired programming model. In particular, Kokkos allows us to intercept allocation and deallocation of large memory arrays, called compute buffers. Linear algebra objects such as sparse matrices, vectors, and preconditioners use compute buffers exclusively to store data on which they plan to execute parallel kernels. This lets us restrict where memory faults may occur, with minimal changes to the code of affected linear algebra objects. Kokkos also handles intranode parallelism in a generic way that encompasses both multicore CPU and GPU-based hardware. (In fact, this is why Kokkos needs control of memory allocation; it may need to place data on a GPU or other accelerator with a separate memory space from the CPU.) This lets our FT-GMRES prototype use hybrid parallelism (MPI and a threading library of our choice) without additional effort. Our software prototype currently works with multiple CPU-based threading libraries; we do not currently have GPU fault detection or injection capability, but this could be added at the level of the application / OS interface without changing our Trilinos modifications. We first extended the Kokkos interface to support marking or unmarking a compute buffer as “failable.” This operation directly invokes the application / OS interface discussed in Section 9. Our Kokkos extension gives us two ways to mark failability. We may either mark or unmark all subsequent allocations of compute buffers of a particular type (e.g., double) as failable, or mark or unmark a particular compute buffer. The first option lets us experiment with faults in Tpetra-based libraries without modifying their code. (For example, we can compute the sparse matrix A reliably, then intercept final assembly so that the matrix data are stored unreliably.) The second option – marking each buffer individually – lets us extend Tpetra linear algebra objects to implement the Failable interface. We then made Tpetra sparse matrices (CrsMatrix) and dense vectors (MultiVector) implement the Failable interface. Just like compute buffers, Failable objects may be marked or unmarked failable. Certain data in the object may experience memory faults only if the object is currently marked failable. Marking a Failable object consisting of compute buffers means marking some of its compute buffers. The object’s implementation gets to control which compute buffers may experience faults. For example, our sparse matrices only mark their entries, not the sparsity structure. We can also compose more complicated Failable objects out of simpler Failable objects. For example, an ILUT incomplete factorization preconditioner consists of two sparse matrices (the L and U factors); marking the preconditioner failable means marking the L and U factors
منابع مشابه
Fault-Tolerant Iterative Methods via Selective Reliability
Current iterative methods for solving linear equations assume reliability of data (no “bit flips”) and arithmetic (correct up to rounding error). If faults occur, the solver usually either aborts, or computes the wrong answer without indication. System reliability guarantees consume energy or reduces performance. As processor counts continue to grow, these costs will become unbearable. Instead,...
متن کاملAnalysis of Selective Fault - Tolerant , Hard Real - Time
An increasing number of applications are demanding real-time performance from their multiprocessor systems. For many of these applications, a failure may produce disastrous results. Such failures are avoided in hard real-time systems by the use of fault-tolerance. In hard real-time multiprocessor scheduling, this fault tolerance may be provided by including several task backups in each schedule...
متن کاملReliability Growth of Fault - Tolerant Software
Two fault-tolerant software techniques are investigated: recovery block and N-version programming. For each, the stable reliability model is transformed into a model that considers reliability growth via the transformation approach based on the hyperexponential model. Analytic and numeric processing of the transformed models identify the influence of fault removal on the reliability of the faul...
متن کاملA Microprocessor-Based Hybrid Duplex Fault-Tolerant System
Reliability is one of the fundamental considerations in the design of industrial control equipment. The microprocessor-based Hybrid Duplex fault-tolerant System (HDS) proposed in this paper has high reliability to meet this demand although its hardware structure is simple. The hardware configuration of HDS and the fault tolerance of this system are described. The switching control strategies in...
متن کاملFault detection and fault tolerant control of vehicle semi-active suspension system with magnetorheological dampers
In engineering application the sensor or actuator fault will lead to seriously damage to mechanical systems. The research of sensor or actuator fault diagnosis and fault-tolerant control is very important to improve the safety and reliability of the system. The paper investigates the fault diagnosis and fault-tolerant methods of vehicle suspension system with Magnetorheological (MR) dampers (ac...
متن کاملH∞ Fault Tolerant Control of WECS Based on the PWA Model
The main contribution of this paper is the development of fault tolerant control for a wind energy conversion system (WECS) based on the stochastic piecewise affine (PWA) model. In this paper the normal and fault stochastic PWA models for WECS including multiple working points at different wind speeds are established. A reliable piecewise linear quadratic regulator state feedback is designed fo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1206.1390 شماره
صفحات -
تاریخ انتشار 2012